Skip to content
This repository has been archived by the owner on Dec 9, 2018. It is now read-only.

Convert illegal html unicode without glyph to space or zero-width space. #493

Open
wants to merge 1 commit into
base: incoming
Choose a base branch
from

Conversation

duanyao
Copy link
Collaborator

@duanyao duanyao commented Mar 21, 2015

This patch convert space-like char (usually illegal in html) in PDF to space/ZWSP in html, rather than private unicodes. Now sample in #477 can be converted perfectly.

This patch depends on CairoFontEngine so ENABLE_SVG should be on (if set off, will not convert at all).

There are minor modifications to CairoFontEngine.h|cc which should be merged in furture update.

<< ' ' << CSS::WHITESPACE_CN << wid << "\">" << (target > (threshold - EPS) ? " " : "") << "</span>";
<< ' ' << CSS::WHITESPACE_CN << wid << "\">";
if (target > (threshold - EPS))
dump_unicode(out, ' ');
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the space character will bring extra width after the span, which could be unintended?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The space is in that span, not after it. This change doesn't change previous behavior, just updates last_output_unicode.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, sorry for the mistake.

@coolwanglu
Copy link
Owner

Thanks for the patch! Maybe we can enabled it also for --space-as-offset, or just wait until such kind of PDF files emerge.

One thing I don't quite like is that touching Cairo* is not recommended, would make it harder to maintain.

I feel that this patch adds too many states/variables, making the renderer even more messy. Of course this is due to the ugly design of the HTMLRenderer currently. I wonder if we may delay this until the overhual is finished, such that this feature may be implemented as a separated self-enclosing pass.

Finally, could you also add a test case for this (maybe after we have no issues left for the code).

Other possible methods I was thinking of

  • To detect empty chars using FontForge, but Type 3 fonts might not be supported, and the FontForge API could be messy
  • To detect empty chars by painting them (e.g. with SplashOutputDev), and to check if there's any update, which could be done in the preprocessor

@duanyao
Copy link
Collaborator Author

duanyao commented Mar 21, 2015

One thing I don't quite like is that touching Cairo* is not recommended, would make it harder to maintain.

Then we can copy necessary codes from CairoFontEngine.h|cc and make a dedicated font engine for this task. Actually this engine don't have to depend on cairo, only depends on poppler and freetype. Detecting empty glyph with freetype seems simple enough and should be much faster than actually drawing chars.

I feel that this patch adds too many states/variables, making the renderer even more messy.

Yeah, I also plan to simplify it a little in following days, especially in drawSting().

Finally, could you also add a test case for this (maybe after we have no issues left for the code).

The sample file in #477 was used to test this patch.

Unforturnately I found a problem in browsers, FF and webkit disagree with whether add letter-spacing around ZWSP:
<div style="font-size:30px;letter-spacing:10px">M&#x200B;M</div>
This means we can't output ZWSP if letter-spacing is not zero. Can we can just omit ZWSP in this case? it should be rare.

@coolwanglu
Copy link
Owner

Further, it seems that this is better done in Preprocessor, as checking if a glyph is empty might be expensive, we may check each glyph used in the document first, and store the results along with the mapping info.
But again, it would be better to do so after the overhaul.

@duanyao
Copy link
Collaborator Author

duanyao commented Mar 22, 2015

OK, I agree.

@duanyao
Copy link
Collaborator Author

duanyao commented Mar 23, 2015

After some thought, I think most (if not all) "illegal html unicode without glyph" should be converted to a <span class="_ _N">&#x20;<span>, because:

  • Browsers implement letter-spacing for ZWSP etc differently (should not apply, FF and IE have bugs ), while implement letter-spacing for inline-block consistently (not apply at all). So converting an illegal unicode to a ZWSP + offset pair is not ideal.
  • In order to largely maintain the sematic, most illegal html unicodes should be converted to delimiters (space or something), not just empty spans. P.S. I think if --space-as-offset is on, at least one space should also be retained for each converted offset.

In order to accomplish this, HTMLTextLine::append_offset(double width) may be extended to HTMLTextLine::append_offset(double width, Unicode char), means a char (in most cases a space) can be assosiated to the offset. This is called "mixed offset"

HTMLTextLine::text field can also be extended to type vector<HTMLTextUnit>, and HTMLTextUnit is defined as:

struct HTMLTextUnit
{
  HTMLTextState * state; // or int state, index of HTMLTextLine::states
  int char_count; // corresponding to how many CharCode
  Unicode unicode;
  float width; // width of offset
}

Type of HTMLTextUnit is implied:

  • Normal text: char_count == 1, unicode > 0, width is NaN (means defined by font/state)
  • Pure offset: char_count == 0, unicode == 0, width != 0
  • Mixed offset: char_count > 0, unicode == 0x20 (may be extended), width is not NaN
  • Repeating spaces: char_count > 1, unicode == 0x20, width is NaN
  • Decomposed ligature
    • First: same as normal text
    • Followings: char_count == 0, unicode > 0
  • Not in use: char_count == 0, unicode == 0, width == 0

Some notes:

  • char_count can be used to sync with CovertTextDetector
  • Pure/mixed offsets may be merged during text optimazition, and char_count and width are added up. After merging, removed HTMLTextUnit can be marked as "not in use" (don't have to compact the vector).
  • Large offsets may converted to "repeating spaces" during text optimazitionno, this will invalidate char_count. Large offsets may be converted to muliple normal text units..
  • HTMLTextLine::offsets and HTMLTextLine::decomposed_text are not needed anymore.

@coolwanglu coolwanglu self-assigned this Jun 23, 2015
jwuttke added a commit to jwuttke/pdf2htmlEX that referenced this pull request Sep 29, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants